Phrase-Based Document Categorization

نویسندگان

  • Cornelis H. A. Koster
  • Jean Beney
  • Suzan Verberne
  • Merijn Vogel
چکیده

(Chapter in Springer book ”Current Challenges in Patent Information Retrieval”, to appear in May 2011) This paper takes a fresh look at an old idea in Information Retrieval: the use of linguistically extracted phrases as terms in the automatic categorization of documents, and in particular the pre-classification of patent applications. In Information Retrieval, until now there was found little or no evidence that document categorization benefits from the application of linguistics techniques. Classification algorithms using the most cleverly designed linguistic representations typically did not perform better than those using simply the bag-of-words representation. We have investigated the use of dependency triples as terms in document categorization, according to a dependency model based on the notion of aboutness and using normalizing transformations to enhance recall. We describe a number of large-scale experiments with different document representations, test collections and even languages, presenting evidence that adding such triples to the words in a bag-of-terms document representation may lead to a statistically significant increase in the accuracy of document categorization.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Loose Phrase String Kernels

When representing textual documents by feature vectors for the purposes of further processing (e.g. for categorization, clustering, or visualization), one possible representation is based on “loose phrases” (also known as “proximity features”). This is a generalization of n-grams: a loose phrase is considered to appear in a document if all the words from the phrase occur sufficiently close to e...

متن کامل

AuTopicGen: Rule based Positional Pattern Approach for Topic Collection in IR

IR systems consist of phases like document preprocessing, indexing, query expansion, query matching, ranking etc. The document preprocessing phase is the most important phase to parse the document and collect keywords. Relevance of overall IR system improves if main topics of document are perfectly identified during this phase. It is a known fact that Topics are mostly phrase based. Existing ph...

متن کامل

Learning Monolingual Compositional Representations via Bilingual Supervision

Bilingual models that capture the semantics of sentences are typically only evaluated on cross-lingual transfer tasks such as cross-lingual document categorization or machine translation. In this work, we evaluate the quality of the monolingual representations learned with a variant of the bilingual compositional model of Hermann and Blunsom (2014), when viewing translations in a second languag...

متن کامل

Using Noun Phrase Heads to Extract Document Keyphrases

Automatically extracting keyphrases from documents is a task with many applications in information retrieval and natural language processing. Document retrieval can be biased towards documents containing relevant keyphrases; documents can be classified or categorized based on their keyphrases; automatic text summarization may extract sentences with high keyphrase scores. This paper describes a ...

متن کامل

Text Mining

“Bag of words” model, acronym extraction, authorship ascription, coordinate matching, data mining, document clustering, document frequency, document retrieval, document similarity metrics, entity extraction, hidden Markov models, hubs and authorities, information extraction, information retrieval, key-phrase assignment, key-phrase extraction, knowledge engineering, language identification, link...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2011